Data Mining Methodological Weaknesses and Suggested Fixes

نویسنده

  • John H. Maindonald
چکیده

Predictive accuracy claims should give explicit descriptions of the steps followed, with access to the code used. This allows referees and readers to check for common traps, and to repeat the same steps on other data. Feature selection and/or model selection and/or tuning must be independent of the test data. For use of cross-validation, such steps must be repeated at each fold. Even then, such accuracy assessments have the limitation that the target population, to which results will be applied, is commonly different from the source population. Commonly, it is shifted forward in time, and it may differ in other respects also. A consequence of source/target differences is that highly sophisticated modeling may be pointless or even counter-productive. At best, model effects in the target population may be broadly similar. Investigation of the pattern of changes over time is required. Such studies are unusual in the data mining literature, in part because relevant data have not been available. Several recent investigations are noted that shed interesting light on the comparison between observational and experimental studies, with particular relevance when there is an interest in giving parameter estimates a causal interpretation. Data mining activity would benefit from wider co-operation in the development and deployment of computing tools, and from better integration of those tools into the publication process.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Constructing and Comparing User Mobility Profiles

Nowadays, the accumulation of people’s whereabouts due to location-based applications has made it possible to construct their mobility profiles. This access to users’ mobility profiles subsequently brings benefits back to location-based applications. For instance, in on-line social networks, friends can be recommended not only based on the similarity between their registered information, e.g., ...

متن کامل

Revisiting data mining: ‘hunting’ with or without a license

The primary objective of this paper is to revisit a number of empirical modeling activities which are often characterized as data mining, in an attempt to distinguish between the problematic and the non-problematic cases. The key for this distinction is provided by the notion of severity proposed by Mayo (1996). It is argued that many unwarranted data mining activities often arise because of in...

متن کامل

Opinion Mining, Social Networks, Higher Education

Background and Aim: With the advent of technology and the use of social networks such as Instagram, Facebook, blogs, forums, and many other platforms, interactions of learners with one another and their lecturers have become progressively relaxed. This has led to the accumulation of large quantities of data and information about studentschr('39') attitudes, learning experiences, opinions, and f...

متن کامل

SMT error analysis and mapping to syntactic, semantic and structural fixes

This paper argues in favor of a linguisticallyinformed error classification for SMT to identify system weaknesses and map them to possible syntactic, semantic and structural fixes. We propose a scheme which includes both linguistic-oriented error categories as well as SMT-oriented edit errors, and evaluate an English-Spanish system and an English Basque system developed for a Q&A scenario in th...

متن کامل

Chained System: A Linear Combination of Different Types of Statistical Machine Translation Systems

The paper explores a way to learn post-editing fixes of raw MT outputs automatically by combining two different types of statistical machine translation (SMT) systems in a linear fashion. Our proposed system (which we call a chained system) consists of two SMT systems: (i) a syntax-based SMT system and (ii) a phrase-based SMT system (Koehn, 2004). We first translate source sentences of the bite...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006